{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "m4QTBIbYHLSe"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import scipy.integrate as integrate\n",
    "import scipy.optimize as optimize\n",
    "import scipy.stats as stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "sSkiWaFIHLSj"
   },
   "source": [
    "# Regression without knowing the underlying model\n",
    "\n",
    "[Video timestamp](https://youtu.be/jReGEZXq4Ac?t=2450)\n",
    "\n",
    "In the first part of the lesson, we studied linear regression and made a crucial assumption: to estimate the parameter, we knew that the underling model was linear. What happens if we remove this assumption?\n",
    "\n",
    "To understand it, we will still look at a regression task but we relax this assumption as follows: we know that the underlying model is a polynomial with additive noise:\n",
    "$$\n",
    "y(i) = p^*(x(i)) + \\epsilon(i),\n",
    "$$\n",
    "where $p^*$ is an unknown polynomial that we need to estimate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ztQcfaqNHLSk"
   },
   "outputs": [],
   "source": [
    "t = 0.5\n",
    "sigma = 0.2\n",
    "def model(x):\n",
    "    return t*x**3 + sigma*np.random.randn()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "jqoSViqGHLSn"
   },
   "outputs": [],
   "source": [
    "D = np.random.uniform(-2,2,25)\n",
    "D = np.sort(D)\n",
    "Y = [model(d) for d in D]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "B_FdGldmHLSq"
   },
   "outputs": [],
   "source": [
    "plt.scatter(D,Y)\n",
    "D_plot=  np.arange(-2,2.1,0.015)\n",
    "plt.plot(D_plot,t*D_plot**3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "r9yGENujHLSu"
   },
   "source": [
    "## What goes wrong with our previous approach?\n",
    "\n",
    "[Video timestamp](https://youtu.be/jReGEZXq4Ac?t=2578)\n",
    "\n",
    "We can define a quadratic loss:\n",
    "$$\n",
    "J(p) = \\frac{1}{m}\\sum_{i=1}^m \\left(y(i)-p(x(i)) \\right)^2,\n",
    "$$\n",
    "and minimizes it among all polynomials."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "omGkOAJUHLSv"
   },
   "outputs": [],
   "source": [
    "z = np.polyfit(D, Y,len(D)-1)\n",
    "p = np.poly1d(z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "2fUYwZx0HLSx"
   },
   "outputs": [],
   "source": [
    "plt.scatter(D,Y)\n",
    "plt.plot(D_plot,t*D_plot**3)\n",
    "plt.ylim(-4,4)\n",
    "plt.plot(D_plot,p(D_plot))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "83Z2JLvGHLS0"
   },
   "source": [
    " Great, we achieve a loss of 0 but clearly our solution does not seem right!\n",
    " \n",
    " Let cheat a bit a see what happens if we are looking at a solution with the right degree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "9fhgKgzkHLS1"
   },
   "outputs": [],
   "source": [
    "z = np.polyfit(D, Y,5)\n",
    "p = np.poly1d(z)\n",
    "plt.scatter(D,Y)\n",
    "plt.plot(D_plot,t*D_plot**3)\n",
    "plt.ylim(-4,4)\n",
    "plt.plot(D_plot,p(D_plot))             "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "NFYJEc4SHLS4"
   },
   "source": [
    "[Video timestamp](https://youtu.be/jReGEZXq4Ac?t=2720)\n",
    "\n",
    "OK, this looks much better, but we cheated! Indeed, you can modify the degree for your polynomial fit and see that it is not easy to decide between degrees 3,4,5...\n",
    "\n",
    "We can now formalize our problem as follows. Given the points $(x(i),y(i))$, we need to do two things:\n",
    " - decide on the degree of the polynomial $p^*$;\n",
    " - once the degree is fixed, estimate the parameters of the polynomial.\n",
    " \n",
    "One natural way to deal with this new formulation of the problem is to check all possible degrees and make an estimation of the parameters for each possible choice. But then, we need to decide which degree to select. In order to do that, we will split the dataset in a training set and a validation set. We will use the training set to estimate the parameters of the polynomial for all possible degrees. To decide which degree we should select, we will compute the loss of the obtained polynomial on the validation set and pick the one with minimal validation loss.\n",
    "\n",
    "Let see if this works?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vxcRuY1LHLS4"
   },
   "outputs": [],
   "source": [
    "D_train = D[1::2]\n",
    "Y_train = Y[1::2]\n",
    "D_val = D[::2]\n",
    "Y_val = Y[::2]\n",
    "plt.scatter(D_train,Y_train)\n",
    "plt.scatter(D_val, Y_val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "A53Koh5nHLS7"
   },
   "outputs": [],
   "source": [
    "def get_error(deg):\n",
    "    val_error = np.zeros(deg)\n",
    "    train_error = np.zeros(deg)\n",
    "    for i in range(deg):\n",
    "        z = np.polyfit(D_train, Y_train, i)\n",
    "        p = np.poly1d(z)\n",
    "        train_error[i] = (np.mean((p(D_train)-Y_train)**2))\n",
    "        val_error[i] = (np.mean((p(D_val)-Y_val)**2))\n",
    "    return train_error, val_error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tgjMjRyjHLS9"
   },
   "outputs": [],
   "source": [
    "train_error, val_error = get_error(len(D_train)-1)\n",
    "plt.figure(figsize=(14,7))\n",
    "plt.plot(train_error, label='Training error')\n",
    "plt.ylim(0, 3)\n",
    "plt.plot(val_error, label='Validation error')\n",
    "plt.xlabel(\"Degree\")\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-QqI82iwHLTA"
   },
   "source": [
    "We see that the error on the training set is decreasing until it reaches 0 when the polynomial is able to inpterpolate all the points of the training set.\n",
    "\n",
    "The error on the validation set is first decreasing as the training error but then starts to increase again. This is because the polynomial interpolating through the points of the training set is now missing a lot of points of the validation set, as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "ZW9CpK3gHLTA"
   },
   "outputs": [],
   "source": [
    "z = np.polyfit(D_train, Y_train, len(D_train)-1)\n",
    "p = np.poly1d(z)\n",
    "plt.scatter(D_val, Y_val)\n",
    "plt.scatter(D_train,Y_train)\n",
    "plt.plot(D_plot,t*D_plot**3)\n",
    "plt.ylim(-4,4)\n",
    "plt.plot(D_plot,p(D_plot)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VyzU0QgiHLTD"
   },
   "source": [
    "[Video timestamp](https://youtu.be/jReGEZXq4Ac?t=2935)\n",
    "\n",
    "To summarize:\n",
    " - for low degrees, our parametric model is not expressive enough to capture the true model resulting in a high error both on the training and validation set.\n",
    " - for high degrees, our model becomes very expressive and start to actually fit the noise in the dataset, resulting in a low error on the training set and a high error on the validation set.\n",
    " \n",
    "To formalize a bit what happens, we need to introduce the notion of risk. For an estimator $f:\\mathbb{R}\\to\\mathbb{R}$, we define the risk as:\n",
    "$$\n",
    "\\mathcal{R}(f) = \\mathbb{E}\\left[(f(X)-Y)^2\\right],\n",
    "$$\n",
    "where the average is taken over randomness of $(X,Y)$. In our case, we simulate the true model with $X\\sim Unif[-2,2]$ and $Y = 0.5*X^3+\\sigma \\epsilon$, where $\\epsilon\\sim \\mathcal{N}(0,1)$.\n",
    "\n",
    "We also define $\\mathcal{H}_k$ as the set of polynomial of maximum degree $k$. $\\mathcal{H}_k$ is the hypothesis space. We denote by $H_\\infty$ the set of all polynomials.\n",
    "\n",
    "Our goal is to find:\n",
    "$$\n",
    "p^* = \\arg\\min_{p\\in \\mathcal{H}_\\infty}\\mathcal{R}(p).\n",
    "$$\n",
    "\n",
    "In our case, we can compute the risk:\n",
    "\\begin{eqnarray*}\n",
    "\\min_{p\\in \\mathcal{H}_\\infty}\\mathcal{R}(p) &=& \\mathbb{E}\\left[(Y-\\mathbb{E}[Y|X])^2\\right]\\\\\n",
    "&=& \\mathbb{E}\\left[ (Y-0.5 X^3)^2\\right]\\\\\n",
    "&=& \\sigma^2\n",
    "\\end{eqnarray*}\n",
    "\n",
    "But in practice, we do not have access to the true underlying model defining the distribution of $(X,Y)$, hence we are not able to evaluate the average defining the risk.\n",
    "\n",
    "Hence, we define the empirical risk:\n",
    "$$\n",
    "\\hat{\\mathcal{R}}(p) = \\frac{1}{m}\\sum_{i=1}^m \\left(y(i)-p((x(i))\\right)^2,\n",
    "$$\n",
    "which is an approximation of the true risk.\n",
    "To be more precise, we defined two differetn empirical risks: for $\\hat{\\mathcal{R}}_{train}(p)$, the average is taken over the training set and for $\\hat{\\mathcal{R}}_{val}(p)$, the average is taken over the validation set.\n",
    "\n",
    "Now we define the polynomial of degree at most $k$ minimizing the empirical risk on the training set:\n",
    "$$\n",
    "\\hat{p}_k = \\arg\\min_{p\\in \\mathcal{H}_k}\\hat{\\mathcal{R}}_{train}(p).\n",
    "$$\n",
    "\n",
    "The training error above is given by $\\hat{\\mathcal{R}}_{train}(\\hat{p}_k)$ and the validation error by $\\hat{\\mathcal{R}}_{val}(\\hat{p}_k)$.\n",
    "\n",
    "Since the data points in the validation set are not used for the polynomial fit, we have $\\hat{\\mathcal{R}}_{val}(\\hat{p}_k)\\approx \\mathcal{R}(\\hat{p}_k)$.\n",
    "\n",
    "Unfortunately, what we would like to compute is\n",
    "$$\n",
    "p^*_k = \\arg\\min_{p\\in \\mathcal{H}_k}\\mathcal{R}(p).\n",
    "$$\n",
    "Note that $\\mathcal{R}(p^*_k) \\downarrow \\mathcal{R}(p^*)$ as $k\\to \\infty$.\n",
    "Unfortunately, our experiment above shows us that $p^*_k\\neq \\hat{p}_k$, especially for large values of $k$.\n",
    "\n",
    "[Video timestamp](https://youtu.be/jReGEZXq4Ac?t=3280)\n",
    "\n",
    "In all cases, we can decompose the risk of our estimator in the following non-negative terms:\n",
    "$$\n",
    "\\mathcal{R}(\\hat{p}_k) = \\underbrace{\\mathcal{R}(\\hat{p}_k)-\\mathcal{R}(p^*_k)}_{(1)} + \\underbrace{\\mathcal{R}(p^*_k)-\\mathcal{R}(p^*)}_{(2)}+\\mathcal{R}(p^*).\n",
    "$$\n",
    "\n",
    "The first term is called the **estimation error**, the second term is called the **approximation error** and the last term $\\mathcal{R}(p^*)$ is the true risk.\n",
    "\n",
    "Clearly, as $k\\to \\infty$, the approximation error (2) vanishes as our model becomes more and more expressive. In our case, the approximation error is 0 for $k\\geq 3$. But the estimation error (1) will unfortunately grows with $k$. Going back to our empirical findings above, we see that for low values of $k$, $\\mathcal{R}(\\hat{p}_k)$ is high because the approximation error (2) is high and the for high values of $k$, $\\mathcal{R}(\\hat{p}_k)$ is high because the estimation error (1) is high. In practice, we take the minimum of this curve as an estimate of $\\mathcal{R}(p^*)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "K9SqInq7HLTE"
   },
   "outputs": [],
   "source": [
    "def get_risk(deg):\n",
    "    risk = np.zeros(deg)\n",
    "    for i in range(deg):\n",
    "        z = np.polyfit(D_train, Y_train, i)\n",
    "        p = np.poly1d(z)\n",
    "        fun_int= lambda e,x : (p(x)-t*x**3 - sigma*e)**2*stats.norm.pdf(e)/4\n",
    "        risk[i] = integrate.dblquad(fun_int,-2,2,lambda x: -10, lambda x: 10)[0]\n",
    "    return risk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "tG_A0B57HLTG"
   },
   "outputs": [],
   "source": [
    "risk = get_risk(len(D_train)-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "EOD6QWM3HLTI"
   },
   "outputs": [],
   "source": [
    "def opti_risk(x):\n",
    "    if int(x) == 0:\n",
    "        return t**2*2**6/7+sigma**2\n",
    "    elif int(x) == 1:\n",
    "        return t**2*2**6/7+sigma**2-t**2*2**6/5**2*3\n",
    "    elif int(x) == 2:\n",
    "        pass# left as an exercise!\n",
    "    else:\n",
    "        return sigma**2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "seXqlpgvHLTK"
   },
   "outputs": [],
   "source": [
    "deg = [0,1,3,4,5,6,7,8,9,10]\n",
    "plt.figure(figsize=(14,7))\n",
    "plt.ylim(0, 3)\n",
    "plt.plot(deg, [risk[int(d)]-opti_risk(d) for d in deg], color='red',label='estimation error')\n",
    "plt.plot(deg, [opti_risk(d)-sigma**2 for d in deg],color='green',label='approximation error')\n",
    "plt.plot(val_error, label='empirical validation error')\n",
    "plt.xlabel(\"Degree\")\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "VYrqxakxHLTM"
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(14,7))\n",
    "plt.ylim(0, 3)\n",
    "plt.plot(risk, color='red',label='risk of esimator $\\mathcal{R}(\\hat{p}_k)$')\n",
    "plt.plot(val_error, label='empirical validation error $\\hat{\\mathcal{R}}_{val}(\\hat{p}_k)$')\n",
    "plt.xlabel(\"Degree\")\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "H3PkcbY6HLTO"
   },
   "source": [
    "[![Dataflowr](https://raw.githubusercontent.com/dataflowr/website/master/_assets/dataflowr_logo.png)](https://dataflowr.github.io/website/)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "include_colab_link": true,
   "name": "polynomial_regression.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
